SL3 Neural Network
Published:
Neural Network
1. Perceptron concepts
1. Concepts:
- transfer function, activation function, weight, threshold
- AND(w1=1, w2=1,theta>1), OR(w1=1, w2=1, theta< 1), XOR(w1=1, w2=-2, s3=1, theta= 1), NOT(w1=-1, theta=0)
- A perceptron will be a hyper-plane in n dimension
2. How to find perceptron weights?:
- Perceptron Rule
- threshold output values Y_hat, must used on linearly-separable data
- delta w = (y-y_hat) * xi — +1, -1, or 0
- learning rate: multiplied by lr
- Gradient Descent
- unthreshold activation function value a, it’s ok if it’s not linearly separable
- What is gradient? the generalization of derivatives in several variables— direction of fastest increase of the function
- error function to be minimized: E(w)=1/2 * sum(y-a)^2 — a=Sum_XiWi
3. Sigmoid function:
- Why: to makes the mapping from input to threshold output differentiable (smooth out the plot of the function)
- What: sig(a) = 1/1-e^(-a)
2. ANN & Describe Back Propagation
1.ANN
- Perceptrons is a single-layer neural network
- The underlying principles behind ANN is divide and recombine, using weights
2.Backpropagation:
- Weights are randomly initiated
- The output of each node was calculated sig(sum(wixi))
- Calculate the error between the true final output and final output
- We pass back the error layer by layer, to adjust the weights using calculated errors
3. Bias of ANN
1. Restriction Bias: representation power of data structure + set of hypotheses we consider.
1. Linear data for perceptron
2. Boolean: network of threshold-like units
3. Continuous: single layer with enough hidden nodes
4. Arbitrary: multiple hidden layers
Note: Not only boolean functions, it can also express continuous function with only 1 hidden layer, as long as it has enough neurons. It can also express arbitrary function by using 2 hidden layers, showing the continuous first, then adding the jumps at the seams between patches.
2. Preference Bias: given two presentations, why we prefer one to another.
1. Choose initial weights to be small random values
1. Random: provide variability to avoid local minima
2. Small: avoid overfitting, reduce complexity
2. Prefer simpler and generalizable representations --> occam's razor
4. How to avoid overfitting?
- Restrict to a bounded number of layers, nodes, and small values of weights
- Use cross valuation to decide the parameters